New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set resource limit for addon containers #10653
Conversation
…data collected by kubernetes#10335. Please noted that both influxdb and heapster could be oom-killed due to memory leakage here.
… on data collected by kubernetes#10335
How often do we expect heapster and influxdb to restart, then? And these numbers are for how many pods? Do we have scaling guidance for users? If we want to ensure that all addons have limits set, we'll need a check, but that can be done later -- file an issue for that. That would also mean that elasticsearch and kibana would need to get memory limits. Otherwise, LGTM |
#10077 was merged. The UI will need limits, presumably. |
GCE e2e build/test passed for commit 6b61918. |
If we're going to run fluentd on the master (#10597), it should definitely have resource limits. As you mention, it should be safe to be oom-killed. |
@bgrant0607 The problem I had today is that both heapster and influxdb don't have any limit, and they are going to oom-killed when system oom-ed eventually, so no, I don't have that restart frequency number yet. I expected to have this pr in, so that I can get that number. The numbers I picked up here is based on 4 nodes, 100 containers per node (30 pods, and 2 containers each). We can have scaling guidance for users based on that. |
Can we fix UI in a separate PR since I really want to collect heapster and influxdb stats as earlier as possible here? @a-robinson fluentd has slight memory leakage overtime. I will send another pr to resolve fluentd issue soon. |
Ack, thanks. Its memory leak would seem to make it even more important to put a cap on its memory usage. |
@a-robinson I pushed an extra commit to update fluentd manifest file too. For UI one, we have to wait for measurements, thus separate PR. :-) |
@@ -9,6 +9,7 @@ spec: | |||
resources: | |||
limits: | |||
cpu: 100m | |||
memory: 200Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't 100Mi more than enough given that in all of Saad's tests it never used that much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, Saad's report in that issue is only for a couple of hours. The limit I chose here is based on over 2 days. Both of us are running the soaking test against our default configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear my report was over the course of 24 hours. It looks like the fluentd container with elasticsearch plugin has a leak, because memory usage continues to grow. After two days it hit 151 MB. See #10335 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the memory limit set for fluentd-gcp
but not fluentd-es
?
Sorry for all the comments, but you should also add limits to |
LGTM |
GCE e2e build/test passed for commit 54531d9. |
@a-robinson Like we mentioned above, cluster/saltbase/salt/fluentd-es/fluentd-es.yaml is not used by GKE and GCE, I would rather not choose the limit for them. My concern is reducing the schedulability for those small cells. |
Since I have two TL's lgtm now, I marked it ok-to-merge. The reason I am rushing for this in is that we are missing important data for heapster and influxdb, for example, restart frequency, etc. |
@zmerlynn or @nikhiljindal Can I have merge for this one? I am going to monitor jenkin tonight and tomorrow. Thanks! |
LGTM |
I'm watching tonight as well. |
Set resource limit for addon containers
Thanks, everyone! You guys are super! |
#10760 for memory leakage of heapster and influxdb |
Measurement is based on current default configuration for GCE / GKE (?) and AWS: 4 nodes, 30~50 pods per node.
For all containers, I allocate 100m for cpu even they all use less than that since it matches today's default setting from LimitRange of default namespace. Memory is very complicated here:
cc/ @bgrant0607 @thockin @saad-ali @a-robinson
@eparis and @justinsb too on valgrant case.